This note serves as an overview of some of the algorithms used in Bayesian statistics, starting with an introduction to markov chains and Monte Carlo methods; the two fundamental components that make up a Markov Chain Monte Carlo (MCMC) process.
A random walk pretty much does what it says on the tin. It’s a collection of random steps that, when plotted, represent a random process. An often used example is a particle moving in one dimension. With equal probability the partical moves up or down at time \(t\) relative to it position at time \(t-1\). This can be represented with a Bernoulli probability mass function. Another example would be to let \(\varepsilon ~ \mathcal{N}(0,1)\) and define, \[ X_n = X_{n-1} + \varepsilon \] which gives us a continuous random walk.
The code below computes a discrete random walk in two dimensions with equal probabilities being asigned to each movement: up, down, left, or right.
# RANDOM WALK
loc <- c("up", "down", "left", "right")
dat <- matrix(NA, ncol = 2, nrow = 1e5/2)
dat[1,] <- c(0,0)
for(i in 2:nrow(dat)) {
rand <- sample(loc, 1, prob = rep(0.25,4))
if(rand == "up") {
dat[i,] <- c(0,1) + dat[i-1,]
}
else if(rand == "down") {
dat[i,] <- c(0,-1) + dat[i-1,]
}
else if(rand == "left") {
dat[i,] <- c(1,0) + dat[i-1,]
}
else if(rand == "right") {
dat[i,] <- c(-1,0) + dat[i-1,]
}
else {
dat[i,] <- c(NA,NA)
}
}Below, the figure on the left plots the first 5,000 steps of the random walk, and the figure on the right plots all 50,000 steps of the walk with the first 5,000 steps highlighted. The darker regions indicate the parts of the process that walked over itself.
Consider a sequence of random variables \(X_0, X_1, X_2, \ldots\) where \(X_n\) denotes the state at time \(n \geq 0\). If the sequence is a Markov chain then the the probability of \(X_n\) depends only on \(X_{n-1}\). Thus, we can write the conditional probability as, \[ P(X_{n} = j | X_{n-1} = i) \] where \(j\) and \(i\) are values from the state space. The conditional probability above is the transition probability of moving from state \(i\) in time \(n-1\) to state \(j\) in time \(n\).
The mathematical statement above is best summarized with the Markov property, which states that only the most recent term \(X_{n-1}\) of all the past values \(X_{n-1}, X_{n-2}, X_{n-3}, \ldots, X_0\) influences the prediction of the future state \(X_n\). This also means that \(X_n\) and \(X_{n-2}, X_{n-3}, \ldots, X_0\) are conditionally independent. One useful feature of Markov chains is that it saves us from having to use the entire history to predict the future value.
To simulate a Markov chain we need to compute the probability of moving from one state to the next. These probabilities can be encoded in a transition matrix, the \((i,j)\) element of which defines the probability of moving to state \(j\) from state \(i\) for all \(i,j = 1, 2, \ldots, K\)$.A transition matrix is square since, given the present state, it must describe the probability of moving to each possible state. Additionally, the probabilities of moving to each state are disjoint (e.g. if our states are directions then it’s not possible to both move up and down). As a result the rows of the transition matrix sum to one (i.e. the states represent all the events in the sample space.)
Note that
Now we want to describe how to actually make the move from state \(n-1\) to state \(n\). Let the \(1 \times K\) vector \(\vec{v}_n\) denote the probabilites of the state at \(n\) and let \(M\) denote the \(K \times K\) transition matrix. Then the marginal distribution of each state is given applying the law of total probability in the following way, \[ \vec{v}_{n} = \vec{v}_{n-1} M \] Each element \(j\) of \(\vec{v}_n\), is equal to \(\sum_{i=1}^{K} (\vec{v}_{n-1})_i \cdot M_{i,j}\). In other words the \(j^{th}\) element of \(\vec{v}_n\) is the dot product of \(\vec{v}_{n-1}\) and the \(j^{th}\) column of \(M\).
Intutively this makes sense. We start with a vector of probabilities describing our previous state and a transition matrix describing the move to the next state. Each column of the transition matrix defines a \(K\) vector of probabilities associated with moving to a specific state \(j\) given each of the previous states the chain could be at \(i = 1, 2, \ldots, K\). Then, applying the law of total probability, the sum of the product of the probability of being at each previous state \(i\) and the associated probability of moving to the next state (given by the \((i,j)\) element of the transition matrix) will give the marginal probability of the next state \(j\).
Similar to the random walk above, we simulate two Markov chains moving through two dimensional space with the movements: up, down, left, and right.
The first Markov chain emulates a random walk and uses the following transition matrix, \[ M = \begin{bmatrix} 0.25 & 0.25 & 0.25 & 0.25 \\ 0.25 & 0.25 & 0.25 & 0.25 \\ 0.25 & 0.25 & 0.25 & 0.25 \\ 0.25 & 0.25 & 0.25 & 0.25 \\ \end{bmatrix} \]
where each column of \(M\) denotes the state probabilities for moving to state \(j\) at time \(n\) given the probabilities at state \(i \in (\mbox{up, down, left, right})\) at time \(n-1\), and each row \(M\) denotes the probability of moving to each new state \(j \in (\mbox{up, down, left, right})\) given state \(i\). This is a Markov chain as a random walk since equal weight has been put on moving from state \(i\) to state \(j\) for all \(j \in (\mbox{up, down, left, right})\). For example, given that the chain’s previous state was up the probability of moving up is \(M_{1,1} = 0.25\) which is equal to the probability of moving down \(M_{1,2} = 0.25\) (and is also equal to the probability of moving left/right).
The second Markov chain uses the following transition matrix, \[ M = \begin{bmatrix} 0 & 0.9 & 0 & 0.1\\ 0.9 & 0 & 0.1 & 0\\ 0.1 & 0 & 0 & 0.9\\ 0 & 0.1 & 09 & 0\\ \end{bmatrix} \] This type of transition matrix encourages the Markov chain to move in the opposite direction compared to its previous state. For example, if our previous state was moving up then (going across the top row of \(M\)) the next state is up with probability 0, down with probability 0.9, left with probability 0, and right with probability 0.1. From this point we will most likely move down. If so then (going across the second row of \(M\)) we move back up with probability 0.9 and left with probability 0.1. But there is also a 0.1 chance that we could move right after moving up. Then (going across the last row) we would move up with probability 0, down with probability 1, left with probability 0.9, and right with probability 0.
The function that we run the Markov chain through is provided below.
# MARKOV CHAIN FUNCTION
#' Construct a Markov Chain governing movement through 2D space
#' @param transition_matrix A symmetric matrix of probabilities. Rows must sum to one.
#' @param iter Number of steps to take over the 2D space.
#' @param init Initial position in the 2D space.
#' @param state Initial state to start the Markov Chain.
#' @return The probabilities (prob) that are used to calculate the movement (move) and the data (dat) that
#' describes the movemement over 2D space.
markov_chain <- function(transition_matrix, iter = 1e3, init = c(0,0), state = c(1,0,0,0)) {
# setup containers
dat <- array(NA, c(iter, 2))
prob <- array(NA, c(iter,4))
move <- array(NA, iter)
# initial parameterizations
loc <- colnames(transition_matrix)
dat[1,] <- init
# loop to contstruct the chain
for(i in 2:iter) {
state <- state%*%transition_matrix
rand <- sample(loc, 1, prob = state)
prob[i,] <- c(state)
move[i] <- c(rand)
if(rand == "up") {
dat[i,] <- c(0,1) + dat[i-1,]
}
else if(rand == "down") {
dat[i,] <- c(0,-1) + dat[i-1,]
}
else if(rand == "left") {
dat[i,] <- c(-1,0) + dat[i-1,]
}
else if(rand == "right") {
dat[i,] <- c(1,0) + dat[i-1,]
}
else {
dat[i,] <- c(NA,NA)
}
}
# return containers as a list
return(list("dat" = dat, "prob" = prob, "move" = move))
}Now we run the simulations using each transition matrix with the initial condition \(v_0 = (0, 0, 1, 0)\).
# MARKOV CHAIN (as a random walk)
trans_prob <- matrix(0.25, ncol = 4, nrow = 4)
colnames(trans_prob) <- row.names(trans_prob) <- c("up", "down", "left", "right")
iter <- 1e5
mc_rw <- markov_chain(trans_prob, iter = iter)
# MARKOV CHAIN (with a negating transition matrix)
trans_prob <- rbind(c(0, 0.9, 0, 0.1),
c(0.9, 0, 0.1, 0),
c(0.1, 0, 0, 0.9),
c(0, 0.1, 0.9, 0))
colnames(trans_prob) <- row.names(trans_prob) <- c("up", "down", "left", "right")
mc <- markov_chain(trans_prob, iter = iter)Below are the probabilities and associated move (via sampling from the binomial distribution) of the first 10 steps of the negating Markov chain.
## x y up down left right move
## 1 0 0 NA NA NA NA <NA>
## 2 0 -1 0.0000000 0.9000000 0.00000000 0.10000000 down
## 3 0 0 0.8100000 0.0100000 0.18000000 0.00000000 up
## 4 0 -1 0.0270000 0.7290000 0.00100000 0.24300000 down
## 5 0 0 0.6562000 0.0486000 0.29160000 0.00360000 up
## 6 0 -1 0.0729000 0.5909400 0.00810000 0.32806000 down
## 7 0 0 0.5326560 0.0984160 0.35434800 0.01458000 up
## 8 1 0 0.1240092 0.4808484 0.02296360 0.37217880 right
## 9 1 1 0.4350599 0.1488262 0.38304576 0.03306816 up
## 10 2 1 0.1722481 0.3948607 0.04464396 0.38824718 right
A Monte Carlo method is typically used as a way to use random sampling to numerically solve a probablem that is not analytically tractable (e.g. the area under a a curve). The algorithm proceeds as follows,
ALGORITHM (Monte Carlo Method)
Using Monte carlo methods we can estimate the mathematical constant \(\pi\) which is equal to ratio of the circumference to the diameter. Consider a unit circle inscribed in a square. The are of the circle is \(\pi\) and the area of the square is 4, so the ratio of the area of the circle and the square is \(\pi/4\). Now all we need to do is construct a Monte Carlo method to estimate this ratio and multiply the result by 4 to get an estimate of \(\pi\).
Below is the code that runs through the steps above to estimate \(\pi\).
circle <- function(x) {
sqrt(1 - x^2)
}
domain <- seq(0,1, length.out = 1e3)
iter <- 1e4 * 2
accept <- matrix(NA, ncol = 2, nrow = iter)
reject <- matrix(NA, ncol = 2, nrow = iter)
for(i in 1:iter) {
x <- runif(1, 0, 1)
y <- runif(1, 0, 1)
if(x^2 + y^2 <= 1)
accept[i,] <- c(x,y)
else
reject[i,] <- c(x,y)
}
cat('estimate of pi = ', (nrow(na.omit(accept)) / iter) * 4)## estimate of pi = 3.143
The figure below illustrates the accepted and rejected values.
The accept-reject algorithm is a Monte Carlo method that enables you to sample from a probability distribution. Let \(f\) be the target distribution (i.e. the distribution you want to get samples from) and let \(g\) be the candidate distribution (i.e. the distribution you use to randomly sample out of the domain of interest). We need these distributions to statisfy two conditions. First, both \(g\) and \(f\) need to be defined over the same support. Second, from some value in the support \(f(x)/g(x) \leq M\) for some constant \(M\). The algorithm proceeds as follows,
ALGORITHM (Rejection Sampling)
# MONTE CARLO - REJECTION SAMPLING
#' A Rejection Sampling algorithm where proposals are generated from Unif(0,1).
#' @param f Target distribution.
#' @param g Candidate distribution.
#' @param support Support of the target distribution.
#' @param M Bound on the likelihood ratio f/g.
#' @param iter Number of iterations to run the algorithm through (iter = accept + reject).
#' @return A list containing accepted/rejected values.
accept_reject <- function(f, g, support = c(0,1), M = 1, iter = 1e3) {
out <- list()
for (i in 1:iter) {
# containers
y <- runif(1, support[1], support[2]) # proposal
u <- runif(1, 0, 1) # stochastic condition
out$y[i] <- y
out$u[i] <- u
# likelihood ratio
ratio <- f(y)/(M*g(y, support[1], support[2]))
# accept-reject condition
if(ratio > u)
out$accept[i] <- y
else
out$reject[i] <- y
}
return(out)
}The code below applies rejection sampling to sample from the beta distribution using different values of \(M\).
beta_sample1 <- accept_reject(f = function(x){dbeta(x, 2, 2)},
g = function(x, a, b){dunif(x, a, b)},
M = 1, iter = 1e4*4)
beta_sample2 <- accept_reject(f = function(x){dbeta(x, 2, 2)},
g = function(x, a, b){dunif(x, a, b)},
M = 2, iter = 1e4*4)The table below describes the total number of accepted (TRUE) and rejected (FALSE) values in each sampling procedure.
## M=1 M=2
## FALSE 7674 20049
## TRUE 32320 19951
Notice that with a low value of \(M\) we ended up accepting too many values. In fact, what ended up happening is that we over accepted where \(f\) has more mass. This is apparent in the figure below which plots the accepted and rejected values. When \(M=1\), the empirical beta distribution getting squashed around its expected value. With a more adequate value for \(M\) (such as \(M=2\)) we get a better approximation of the true beta distribution. (For comparison purposes, the figure on the far left plots samples from the \(Beta(2,2)\) distribution using the R function rbeta().)
If we state our acceptance rule as \(u \leq f(y)/(Mg(y)) \implies u(Mg(y)) \leq f(y)\) then why this is happening should be clear. If \(M_1 < M_2\) then for any given \(u\) and \(y\), \(u(M_1g(y)) < u(M_2g(y))\) which means that the condition \(u(M_1g(y)) \leq f(y)\) will end up accepting \(y\) more than \(u(M_2g(y)) \leq f(y)\).
Now we want to combine these two aforementioned ideas of Markov chains and Monte Carlo methods. If we use a Markov chain to move through some mathematical space and sample according to some Monte Carlo method then what we have is called Markov chain Monte Carlo (MCMC). The [Metropolis algorithm] is a simple case of this. Again, we have a target distribution \(f\) that we are interested in sampling from and a symmetric candidate distribution \(g\) that we generate proposals from. Let the index \(n\) denote the current position of the chain. The algorithm proceeds as follows,
ALGORITHM (Metropolis)
The code below creates a function that runs the Metropolis algorithm. The user can declare the number of iterations and the number of chains. Both are important inputs. The number of iterations ensures that we have a large enough sample from the distribution, and running chains with different initial values allows us to determine whether the samples converge. If they do not converge then we cannot perform any inferences on the sample since it does not adequately represent the distribution. Sometimes this can be remedied by increasing the number of iterations. In this
# METROPOLIS ALGORITHM
#' A Rejection Sampling algorithm where proposals are generated from Unif(0,1).
#' @param f Target distribution.
#' @param iter Number of iterations to run the algorithm through.
#' @param chains Number of chains.
#' @param prop_scale Value of tuning parameter.
#' @return An array of iterations by chains.
metropolis <- function(f, iter = 1e3, chains = 4, prop_scale = 1) {
init <- rnorm(chains, 0, 20)
out <- array(NA, c(iter, chains))
out[1,] <- init
# define a function to sample (i.e. accpet/reject)
sampler <- function(x) {
alpha <- 0
u <- 1
while(alpha < u) {
u <- runif(1, 0, 1) # stochastic criterion
x_star <- rnorm(1, x, 1 * prop_scale) # proposal distribution
alpha <- exp(f(x_star)-f(x)) # acceptance ratio
}
return(x_star)
}
# run the sampler
for(i in 2:iter) {
for(j in 1:chains) {
out[i,j] <- sampler(out[i-1,j]) # accepting x* or x_n
}
}
out
}Before running the algorithm we need to define the posterior distribution that we want to sample from which is done in the code below.
dat <- rnorm(1, rnorm(10, 5, 5), 1)
post_func <- function(x) {
lik <- sum(dnorm(dat, x, 1, log = T))
prior <- dnorm(x, 5, 1, log = T)
lik + prior
}Mathematically we can state our model as, \[ y \sim N(\mu, 1) \\ \mu \sim N(5,1) \]
We run the algorithm and look at the median of the data and the median of the samples of \(\mu\).
metro <- metropolis(f = post_func, chains = 4, iter = 2e3)
rbind("median of data" = median(dat),
"median of met samples" = median(metro))## [,1]
## median of data 11.100220
## median of met samples 8.046592
It looks like we got closer to the true \(\mu\) compared to the maximum likelihood estimate of \(\mu\) (which would have been approximately equal to the mean of the data).
In the figure below the plot on the top left illustrates the full length of all four chains. Although the chains start in disparate locations, they converge pretty quickly. This is not always the case, especially in complicated models. The initial steps are useful only insofar as they eventually take us to most of the mass of the distribution. We should discard most of these initial steps since they are not really exploring much of the mass distribution (the part that we find interesting). It is conventionally to regard the first half of each chain as a warmup (or burnin) for the algorithm and conduct inferences on the last half. Next, the plot on the top right presents the first 100 iteration of each chain and we can see that the chains converge within the first 100 iterations. The plot on the bottom left illustrates the last half of each chain. Finally, the plot on bottom right represents a histogram of the sampled values of \(\mu\).
The Metropolis-Hastings algorithm is similar to the Metropolis algorithm, however, we now relax the assumption that the proposal distribution \(g\) has to be symmetric. The algorithm proceeds in the same was except that now our accept/reject condition is \(u < \frac{f(x^*)}{f(x_n)}\frac{g(x_n|x^*)}{g(x^*|x_n)}\). The component \(frac{g(x_n|x^*)}{g(x^*|x_n)}\) is included in the condition to correct for the asymmetry in the proposal distribution. The benefit of this is efficiency since, in some cases, using an asymmetric proposal distribution can increase the speed of the Markov chain.
The code below creates a Metropolis-Hastings function that allows you to sample from a bivariate distribution, where \(g\) is the bivariate normal distribution.
# METROPOLIS-HASTINGS ALGORITHM
#' A Rejection Sampling algorithm where proposals are generated from Unif(0,1).
#' @param f Target distribution.
#' @param iter Number of iterations to run the algorithm through.
#' @param chains Number of chains.
#' @param init Initial values for each chain.
#' @param prop_scale Value for tuning parameter.
#' @return An array of iterations by parameters by chains.
metropolis_hastings <- function(f, iter = 1e3, chains = 4, init = NULL, prop_scale = 1) {
out <- array(NA, c(iter, 2, chains))
# setup initial values
for(p in 1:2) {
if(is.null(init))
out[1,p,] <- rnorm(chains, 0, 1)
else
out[1,p,] <- init[,p]
}
prop_covmat <- diag(1,2)
# create sampler
sampler <- function(x) {
alpha <- 0
u <- 1
reject <- -1
while(alpha < u) {
reject <- reject + 1
u <- runif(1, 0, 1)
x_star <- rmvnorm(1, x, prop_covmat*prop_scale)
alpha <- exp((f(x_star)-f(x)) +
(dmvnorm(x, x_star, prop_covmat, log = TRUE) - dmvnorm(x_star, x, prop_covmat, log = TRUE)))
# cat("alpha = ", alpha, "; ", "u = ", u, "\n")
}
return(list("post" = x_star, "reject" = reject))
}
# run sampler for each chain
for(j in 1:chains) {
cat(":: chain", j, "::","")
reject <- 0
for(i in 2:iter) {
sampler_stuff <- sampler(out[i-1,,j])
reject <- reject + sampler_stuff$reject
out[i,,j] <- sampler_stuff$post # important! the rest is just print aesthetics
if(i%%(iter/10) == 0)
cat(".")
if(i == iter)
cat("", "accept rate =", iter,"/", reject + iter, "=", round(iter/(reject+iter),4), "\n")
}
}
out
}The code below creates the data and runs the Metropolis-Hastings algorithm. Our posterior distribution that we are interested in sampling from can be stated as, \[ y \sim \mathcal{MN}(\vec{\mu}, \mathbf{\Sigma}) \\ \vec{\mu}_1 \sim \mathcal{N}(0, 0.1) \\ \vec{\mu}_2 \sim \mathcal{N}(0, 0.1) \\ \]
Where \(\Sigma\) is a diagonal matrix with ones on the diagonal.
library(mvtnorm)
# parameters and data
covmat <- diag(1,2)
N <- 5
priors <- cbind(rnorm(N, 0, 3), rnorm(N, 0, 3))
dat <- array(NA, c(N, 2))
for(i in 1:N) {
dat[i,] <- rmvnorm(1, mean = priors[i,], sigma = covmat)
}
# initial values for algorithm
init_val <- rbind(c(1,1),
c(-1,-1),
c(1,-1),
c(-1,1))
init_val <- init_val * 4
# posterior distribution
posterior_func <- function(pars) {
lik <- sum(dmvnorm(dat, pars, sigma = covmat, log = TRUE))
prior <- dnorm(pars, 0, 0.1, log = TRUE)
lik + sum(prior)
}
# run algorithm
metro_hasti <- metropolis_hastings(f = posterior_func,
iter = 2e3, chains = 4, init = init_val,
prop_scale = 0.005)## :: chain 1 :: .......... accept rate = 2000 / 3059 = 0.6538
## :: chain 2 :: .......... accept rate = 2000 / 3040 = 0.6579
## :: chain 3 :: .......... accept rate = 2000 / 3124 = 0.6402
## :: chain 4 :: .......... accept rate = 2000 / 3065 = 0.6525
Below we have the means of each parameter using the data and the sampled values. Again, we get a lot closer to the true parameter values with the Bayesian approach, compared to the maximum likelihood approach.
## mu1 mu2
## mean data -0.88561791 -0.72070321
## mean posterior -0.04652705 -0.03576534
Now we have examine the convergence of the chains. The plot below looks at convergence in the two dimensional space. (We used starting values in each corner of a square for illustrative purposes.)
The figure below illustrates the traceplots for each chain with the warmup shaded in grey.
The figure below illustrates the samples for each parameter using histograms.
The Metropolis and Metropolis-Hasting algorithms are sensitive to tuning. Above we used a fairly inefficient tuning parameter prop_scale = 0.005 (this is apparent in the large number of accepted proposals). It is inefficient in the sense that it over accepted proposals out in the tails. (In this example we know that the tails are not a particularly interesting part of the distribution since our likelihood is multivariate normal and our priors are normal.) This isn’t a huge problem it we run a large number of iterations, but that is not always feasible. Below we rerun the model with a more efficient value for the tuning parameter.
metro_hasti_tuned <- metropolis_hastings(f = posterior_func,
iter = 2e3, chains = 4, init = init_val,
prop_scale = 0.1)## :: chain 1 :: .......... accept rate = 2000 / 13018 = 0.1536
## :: chain 2 :: .......... accept rate = 2000 / 13263 = 0.1508
## :: chain 3 :: .......... accept rate = 2000 / 13362 = 0.1497
## :: chain 4 :: .......... accept rate = 2000 / 13371 = 0.1496
If we plot the first 100 iterations, notice that the chains with the inefficient tuning parameter did not even get a chance to mix under the center mass of the distribution. However, with better tuning there is already a lot of mixing taking place. Remember, the whole point of sampling from an unknown distribution is to get to the interesting part of the distribution and explore it.
The Hamiltonion Monte Carlo algorithm is a concept borrowed from Physics that does a more efficient job of exploring the target distribution. It does this by supressing the random walk behavior in Metropolis-Hastings. First, we need to define a few of new components.
In order to do HMC we need the gradient of the posterior distribution with respect to the parameters. If we have a function that takes several inputs then the gradient is the vector of first order deriviatives, where each element in the gradient is the derivative of the function with respect to each parameter. Formally, say we have a function that takes \(J\) parameters and outputs a scalar value, \(f: \mathbb{R}^J \rightarrow \mathbb{R}\). Then the gradient of \(f\) is \(\triangledown f = \bigg(\frac{df}{d\theta_1},\frac{df}{d\theta_2},\ldots,\frac{df}{d\theta_J}\bigg)\).
For each unknown parameter \(\theta_j\) we also need a momementum variable \(\phi_j\). Typically \(\vec{\phi}\) has a multivariate normal distribution with mean zero and covariance matrix \(M\). Often \(M\) is a diagonal matrix so that the elements of \(\vec{\phi}\) are independent of each other. Tuning of the algorithm involves multipling \(M\) by some scalar quanity \(\varepsilon\) that the user can initialize. We say initialize since the algorithm can be setup so that the tuning parameter adapts as the chain explores the target distribution.
ALGORITHM (Hamiltonian Monte Carlo)
The code below creates a function that runs the HMC algorithm using the bivariate normal distribution as the proposal distribution \(g\).
# HAMILTONIAN MONTE CARLO
#' A Rejection Sampling algorithm where proposals are generated from Unif(0,1).
#' @param f Target distribution.
#' @param iter Number of iterations to run the algorithm through.
#' @param chains Number of chains.
#' @param init_val initial values for each chain.
#' @param epsilon_init inital value for tuning parameter.
#' @param M symmetric mass matrix for momentum parameter distribution.
#' @param L Number of leapfrog steps.
#' @return An array of iterations by parameters by chains.
hmc <- function(f, iter = 1e3, chains = 4, init_val = NULL, epsilon_init = 0.1, M = diag(1,2), L_0 = 10) {
out <- array(NA, c(iter, 2, chains))
phi <- array(NA, c(iter, 2, chains))
if(is.null(init_val))
out[1,,] <- t(rmvnorm(chains, c(0,0), diag(1,2)))
else
out[1,,] <- init_val
phi[1,,] <- rmvnorm(chains, c(0,0), M)
gradient <- function(theta) {
e <- 1e-4
out <- array(NA, length(theta))
for(t in 1:length(theta)) {
# browser()
theta_hi <- theta_lo <- theta
theta_hi[t] <- theta[t] + e
theta_lo[t] <- theta[t] - e
out[t] <- (f(theta_hi) - f(theta_lo)) / (2 * e)
# browser()
}
out
}
sampler <- function(theta) {
phi <- c(rmvnorm(1, mean = c(0,0), sigma = M))
grad <- gradient(theta)
epsilon <- runif(1, 0, 2 * epsilon_init)
L <- ceiling(2 * L_0 * runif(1,0,1))
for (l in 1:L) {
phi <- phi + 0.5 * epsilon * grad
theta <- theta + epsilon * solve(M) %*% phi
phi <- phi + 0.5 * epsilon * grad
}
return(list("theta" = c(theta), "phi" = c(phi)))
}
accept_reject <- function(theta_star, theta, phi_star, phi) {
dens_theta_star <- f(theta_star)
dens_theta <- f(theta)
dens_phi_star <- dmvnorm(phi_star, c(0,0), M, log = TRUE)
dens_phi <- dmvnorm(phi, c(0,0), M, log = TRUE)
alpha <- exp((dens_theta_star + dens_phi_star) - (dens_theta + dens_phi))
alpha
}
cat(paste0("['.' = ", iter/10, " iterations]\n"))
for(j in 1:chains) {
cat(paste0("[ chain ", j, " ]"))
for(i in 2:iter) {
alpha <- 0
u <- 1
while(alpha < u) {
u <- runif(1, 0, 1)
sample <- sampler(out[i-1,,j])
alpha <- accept_reject(sample$theta, out[i-1,,j], sample$phi, phi[i-1,,j])
}
phi[i,,j] <- sample$phi
out[i,,j] <- sample$theta
if(i%%(iter/10) == 0)
cat(".")
if(i == iter)
cat("[ complete ]\n")
}
}
out
}The code below sets up the parameters, creates the data, and specifies the posterior distribution function we want to sample from.
# parameters
covmat <- rbind(c(1,0.9),c(0.9,1))
N <- 1e3
priors <- cbind(rnorm(N, 0, 1), rnorm(N, 0, 1))
# priors <- cbind(rep(0,N), rep(0,N)) # fixed
# data
dat <- array(NA, c(N, 2))
for(i in 1:N) {
dat[i,] <- rmvnorm(1, mean = priors[i,], sigma = covmat)
}
# posterior distribution
post_func <- function(theta) {
lik <- sum(dmvnorm(dat, theta, covmat, log = TRUE))
prior <- dnorm(theta, 0, 0.1, log = TRUE)
return(lik + sum(prior))
}We are modeling the data as being generated from the bivariate normal distriubtion with known covariance matrix \(\Sigma\), with standard normal priors on the means. Formally, \[ y \sim \mathcal{MN}({\vec{\theta}, \Sigma}) \\ \theta_1 \sim \mathcal{N}(0, 1) \\ \theta_2 \sim \mathcal{N}(0, 1) \\ \]
# setup initial conditions
init_val <- cbind(c(1,1),
c(-1,-1),
c(1,-1),
c(-1,1))
init_val <- init_val * 4
# run HMC
hmc_sim <- hmc(f = post_func, iter = 1e3, init_val = init_val, epsilon_init = ((1/15)^2), M = diag(1,2), L_0 = 10)## ['.' = 100 iterations]
## [ chain 1 ]..........[ complete ]
## [ chain 2 ]..........[ complete ]
## [ chain 3 ]..........[ complete ]
## [ chain 4 ]..........[ complete ]
We can compare the means from the data and from the parameters sampled using HMC.
mean_table <- rbind("mean data" = colMeans(dat), "mean posterior" = rowMeans(colMeans(hmc_sim)))
colnames(mean_table) <- c("mu1", "mu2")
mean_table## mu1 mu2
## mean data -0.01420323 -0.002619256
## mean posterior -0.00108730 -0.022669300
We had a lot of data and not such a complicated model so it makes sense that (what would be) the maximum likelihood estimate is close to the posterior means for each location parameter.
The figure below plots the traceplot of both parameters.
The figure below illustrates the covergence of the chains in two dimensions. A contour plot of the actual posterior distribution is overlayed. Notice that all of the mixing is taking place in the center mass of the distribution.
Blitzstein, J. K. and Hwang, J. (2014) Introduction to Probaability. Chapman & Hall/CRC Press, New York.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013) Bayesian Data Analysis, 3rd Edition. Chapman & Hall/CRC Press, London.
Robert, C. P. and Casella, G. (2010) Introducing Monte Carlo Methods with R. Springer, New York.